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Background: The purpose of this study was to conduct a meta-analysis on the construct 
and criterion validity of multi-source feedback (MSF) to assess physicians and surgeons in 
practice. 

Methods: In this study, we followed the guidelines for the reporting of observational studies 
included in a meta-analysis. In addition to PubMed and MEDLINE databases, the CINAHL, 
EMBASE, and PsycINFO databases were searched from January 1975 to November 2012. All 
articles listed in the references of the MSF studies were reviewed to ensure that all relevant 
publications were identified. All 35 articles were independently coded by two authors (AA, TD), 
and any discrepancies (eg, effect size calculations) were reviewed by the other authors 
(KA, AD, CV). 

Results: Physician/surgeon performance measures from 35 studies were identified. A random- 
effects model of weighted mean effect size differences (d) resulted in: construct validity coef- 
ficients for the MSF system on physician/surgeon performance across different levels in practice 
ranged from rf=0.14 (95% confidence interval [CI] 0.40-0.69) to rf=1.78 (95% CI 1.20-2.30); 
construct validity coefficients for the MSF on physician/surgeon performance on two different 
occasions ranged from c/=0.23 (95% CI 0.13-0.33) to rf=0.90 (95% CI 0.74-1.10); concurrent 
validity coefficients for the MSF based on differences in assessor group ratings ranged from 
rf=0.50 (95% CI 0.47-0.52) to d=Q.51 (95% CI 0.55-0.60); and predictive validity coefficients 
for the MSF on physician/surgeon performance across different standardized measures ranged 
from rf=1.28 (95% CI 1.16-1.41) to rf=1.43 (95% CI 0.87-2.00). 

Conclusion: The construct and criterion validity of the MSF system is supported by small 
to large effect size differences based on the MSF process and physician/surgeon performance 
across different clinical and nonclinical domain measures. 

Keywords: multi-source feedback system, meta-analysis, clinical performance, construct 
validity, criterion validity 

Introduction 

One of the most widely recognized methods used to evaluate physicians and surgeons 
in practice is multi-source feedback (MSF), also referred to as a 360-degree assessment, 
where different assessor groups (eg, peers, patients, coworkers) rate doctors' clinical 
and nonclinical performance. 1 Use of MSF has been shown to be a unique form of 
evaluation that provides more valuable information than any single feedback source. 1 
MSF has gained widespread acceptance for both formative and summative assessment 
of professionals, and is seen as a trigger for reflecting on where changes in practice 
are required. 2,3 Certain characteristics of health professionals have been assessed using 
MSF, including their professionalism, communication, interpersonal relationships, and 
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clinical and procedural skills competence. 4 One of the main 
benefits of MSF is that it provides physicians and surgeons 
with information about their clinical practice that may help 
them in improving and monitoring their performance. 5 

The number of published studies on the use of MSF to 
assess health professionals in clinical practice has increased 
substantially. In a recent systematic review studying the 
impact of workplace-based assessment of doctors' education 
and performance, Miller and Archer 6 reported evidence of 
support for use of MSF in that it has the potential to lead to 
improvement in clinical performance. Risucci et al 7 dem- 
onstrated concurrent validity for MSF in surgical residents 
by showing a medium effect size correlation coefficient 
between MSF scores and American Board of Surgery In- 
Training Examination (ABSITE) scores. When using MSF 
with residents at different levels in their program, Archer 
et al 8 showed modest increases in the performance of year 
4 in comparison with year 2 trainees, thereby demonstrat- 
ing the construct validity of this approach to assessment. 
Violato et al 9 compared changes in physician performance 
from time 1 to time 2 (a 5-year interval) using total scores 
given by medical colleagues and coworkers using the MSF 
questionnaire and demonstrated a significant improvement in 
their performance over time. Although MSF has been used 
in a variety of contexts, the research focus varies on mea- 
sures across years in programs, differences between assessor 
groups, or comparisons with other assessment methods, so 
the validity of MSF needs to be investigated further. 

The main purpose of this study was to conduct a meta- 
analysis by identifying all published empirical data on the 
use of MSF to assess physicians' clinical and nonclinical 
performance. We conducted a meta-analysis on the construct 
and criterion (predictive or concurrent) validity of the MSF 
system as a function of both summary effect sizes, their 
95% confidence intervals (CIs), and interpretation of the 
magnitude of these coefficients. 

Materials and methods 

Selection of studies 

In this present study, we followed the guidelines for reporting 
of observational studies included in a meta-analysis. 10 In addi- 
tion to PubMed and MEDLINE, the CINAHL, EMBASE, 
and PsycINFO databases were searched from January 1975 
to November 2012. We also manually searched the reference 
lists for further relevant studies. The following terms were 
used in the search: "multi-source feedback", "360-degree 
evaluation", and "assessment of medical professionalism". 
Studies were included if: they used at least one MSF 



instrument (eg, self, colleague, coworker, and/or patient) 
to assess physician/surgeon performance in practice; they 
described the MSF instrument or its design; they described 
factors measured by the MSF instrument; they provided 
evidence of construct-related and/or criterion-related valid- 
ity (predictive/concurrent); and they were published in an 
English-language, peer-reviewed journal. The main reason 
for restricting the search to refereed journals was to ensure 
that only studies of high quality were included in the meta- 
analysis. On the other hand, we excluded studies if they used 
nonmedical health professionals, did not provide a descrip- 
tion or breakdown of what the MSF instrument was measur- 
ing, did not provide empirical data on MSF results, reported 
data on feasibility and/or reliability only, and/or focused on 
performance changes after receiving MSF feedback. 

Data extraction 

The initial search yielded 1,137 papers, as shown in Figure 1 . 
Of these, 623 papers were excluded based on the title, 292 
were excluded based on a review of the abstract, 97 were 
removed as they were duplicates, and a further 90 were 
eliminated after a review of the full-text versions. Finally, 
we agreed on a total of 35 papers to be included for meta- 
analysis. A coding protocol was developed that included each 
study's title, author(s) name(s), year of publication, source of 
publication, study design (ie, construct or criterion validity 
study), physician/surgeon specialty (eg, general practice, 
pediatrics), and types of raters (ie, self, medical colleague, 
consultants, patients, and coworkers). All 35 articles were 
independently coded by two authors (AA and TD) and any 
discrepancies (eg, effect size calculations) were reviewed by 
a third author (KA, AD, or CV). Based on iterative reviews 
and discussions between the five coders, we were able to 
achieve 100% agreement on all coded data. 

Statistical analysis 

The statistical analysis of all effect size calculations was done 
using the Comprehensive Meta-Analysis software program 
(version 1.0.23, Biostat Inc, Englewood, NJ, USA). Most of 
the studies reported mean differences (Cohen's d) between 
MSF scores as effect size measures. However, there were 
some studies that reported the Pearson's product-moment 
correlation coefficient (r). For these studies, and in order 
to preserve consistency in the data that were reported, 
r was converted to Cohen's d using the following formula: 
d=2rH(l-r 1 ). u 

We selected MSF domains or subscale measures as 
the variables of interest and either contrasted these scores 
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Figure I Selection of studies for the meta-analysis. 

between assessor groups (eg, different personnel ratings, 
in-training year, or postgraduate year of practice) or with 
other measures of clinical performance competencies (eg, 
ABSITE or Objective Structured Clinical Examination 
[OSCE]). 

On combination of results from studies that used different 
research designs (eg, different physician year in practice) or 
different personnel ratings (eg, medical colleagues, cowork- 
ers, patients) and methods of analysis between assessor 
groups (eg, MSF in comparison with ABSITE, as well as 



an objective structured practical examination [OSPE]), we 
used a random-effects model in combining the unweighted 
and weighted effect sizes. The fixed-effects model assumes 
that the summary effect size differences are the same from 
study to study (eg, use of MSF with different questionnaires). 
In contrast, the random-effects model calculation reflects a 
more conservative estimate of the between-study variance of 
the participants' performance measures. 12 

In this meta-analysis, residents in different years of 
rotation and the attending physicians/surgeons were treated 
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equally in that they represent treating physicians at different 
stages of their year of practice. Therefore, we are evaluat- 
ing the performance of these 'physicians/surgeons' that 
had a more or less similar trajectory in achieving clinical 
competency as a function of their performance by using the 
multi-source feedback system. 

To assess for the heterogeneity of effect sizes, a forest 
plot with Cochran Q tests was conducted. Absence of a 
significant P-value for Q indicates low power within studies 
rather than the actual consistency or homogeneity across 
studies included in the meta-analysis. In addition, the dis- 
tribution of the studies in the forest plots was an important 
visual indicator to measure the consistency between studies. 
Interpretation of the magnitude of the effect size for both 
mean differences and correlations are based on Cohen's 13 
suggestions, ie, <i=0.20 - 0.49 is "small", <i=0.50 - 0.79 is 
"medium", and d>0.80 is considered to be a "large" effect 
size difference. 

Results 

The characteristics of the 35 studies included in the meta- 
analysis were based on four groups (Table 1) that reported 
contrasts between different physician years in practice 
(group A), differences between physician performance levels 
on two occasions (group B), rating differences between self, 
medical colleague, coworker, and patients (group C), and 
comparisons between MSF and other measures of perfor- 
mance (group D). The reported MSF domain measure (ie, 
items 1 through 5) and the corresponding unweighted effect 
sizes based on either the contrast or comparison variables 
are presented in Table 1. Different approaches to testing 
the validity of MSF were demonstrated by studies included 
in this meta-analysis. In groups A and B, we investigated 
the construct validity of the domains' measures of MSF by 
showing that physicians at different levels of experience 
or on two separate occasions tend to obtain higher clinical 
performance scores. In groups C and D, the criterion validity 
of MSF is compared with other similar assessments of clini- 
cal performance or different raters as either a concurrent or 
predictive validity measure. 

The sample size of the studies range from six plastic 
surgery residents 14 to 577 pediatric residents 15 who had 
been assessed using MSF with as few as 1.2 patients and 
2.6 medical colleagues 16 and as many as 47.3 patients com- 
pleting forms per individual. 17 Questionnaire items used as 
part of MSF ranged from as few as four items 18 to as many 
as 60 items 14 per questionnaire. Information on specific 
demographic characteristics, such as students' sex or age 



was not reported, but level of training and years of practice 
as a physician were typically identified. In each study, the 
unweighted mean effect size difference (Cohen's d) was 
provided or calculated based on the MSF domain measures 
as a contrasting variable (eg, years spent as a physician in 
practice) or with a comparison measure (eg, OSPE). 

Construct validity of MSF system 

Of the 35 studies that reported data on physician/surgeon 
performance, 31 (88%) demonstrated results in support 
of the construct validity of the MSF system. As shown 
in Table 2, we combined five of the studies (group A) 
to show that for each of the five MSF domains the effect 
size differences in performance between a year of practice 
(eg, change in performance as a function of post-graduate 
year 1 to year 2, Senior House Officer to Specialist 
Registrar) 81519 - 21 ranged from d=0.U (95% CI 0.40-0.69) 
for manager skills to d=\ .78 (95% CI 1 .20-2.30) for com- 
munication skills. 

When differences between physician/surgeon per- 
formance were investigated on two different occasions, 
we found four studies (group B) that showed differences 
in clinical performance across the five domain scores of 
MSF. In particular, Brinkman et al 19 compared ratings for 
36 pediatric residents on two occasions with regard to the 
professionalism and communication skills domains, and 
their results showed that there were consistently large effect 
size differences between time 1 and time 2. The ratings on 
these MSF items ranged from d=\3\ for the professional- 
ism domain to <i=2.00 for the communication skills domain. 
Correspondingly, Lockyer et al 22 found a range of MSF scores 
that varied from t/=0.01 for physicians over a 5-year period 
on the professionalism, communication skills, and manage- 
ment domains for self-rating assessment to J=0.66 with the 
same physicians over the professionalism, communication 
skills, and interpersonal relationship domains as rated by 
medical colleagues. Violato et al 9 reported a small effect 
size of d=QA6 when the performance of 250 family physi- 
cians was compared after a 5-year interval between MSF 
assessments. 

Criterion (predictive/concurrent) 
validity of the MSF system 

In group C, we combined the outcomes in 21 (60%) studies 
that investigated the differences in MSF scores provided 
by different raters (eg, residents, self, medical colleague, 
coworker, patients) across the five domains identified. 
Effect size differences in performance between the different 
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Table I Characteristics of MSF studies with construct and criterion (concurrent/predictive) validity effect size measures 

Study source Group Contrast' MSF Effect size 

domain* difference (d^^) 



Archer et al 20 


A 


SPRS (MQ/SHO (MC) 


2, and 5 


1.22 


Sample size, 1 12 pediatrics 










(20 specialist registrars, 92 senior house officers) 










Total forms =921 










Brinkman et al 19 


A 


Feedback (MC)/No-feedback (MC) 


1 , 2, and 3 


1.8 


Sample size, 36 pediatric residents 










( 1 6 with feedback and 1 6 with no feedback) 










Xotal forms —1 263 










Massagli and Carline 21 


A 


PGY2/PGY3 


1 , 2, 4, and 5 


0.05 


Sample size, 56 rehabilitation residents 




PGY2/PGY4 


1,2, 4, and 5 


0.17 


(nine PGY2, nine PGY3, nine PGY4) 




PGY3/PGY4 


1,2, 4, and 5 


0.23 


Total forms =930 










Archer et al 8 


A 


Foundation year 1 (MC)/Foundation 


2, and 5 


0.34 


Sample size, 553 multiple specialties residents 




year 2 (MC) 






(2 1 9 Foundation year 1 ,334 Foundation year 2) 










Total forms =5,544 










Archer et al 15 


A 


SPRS year 2 (MC)/SPRS year 4 (MC) 


2, and 5 


0.29 


Sample size, 577 pediatric (343 SPRS year 2, 










20 1 SPRS year 4, 1 0 pediatricians in years 1,3,5, 6) 










Total forms =4,770 










Wood et al 18 


B 


ObGyn time 1/ObGyn time 2 


4, and 5 


2.41 


Sample size, 67 obstetrics and gynecology residents 










Total forms =578 










Lockyer et al 22 


B 


Phys time l/Phys time 2 (Self) 


1, 2, 3, and 4 


0.46 


Sample size, 250 family physicians 










Total forms =500 










Brinkman et al 19 


B 


Nurse time 1 (CW)/Nurse time 2 


1 , and 2 


1.31 


Sample size, 36 pediatric residents 




(CW) 


1 , and 2 


2.00 


Total forms =1,263 




(Parents) time 1 /(Parents) time 2 






Violato et al 9 


B 


Phys time l/Phys time 2 (MC) 


1 , 2, and 5 


0.66 


Sample size, 250 family physicians 




Phys time l/Phys time 2 (CW) 


1 and 3 


0.22 


Total forms =20,500 




Phys time l/Phys time 2 (Patients) 


1, 3, and 4 


0.01 


Risucci et al 7 


c 


Self/Peer (MC) 


1 , 2, and 5 


0.56 


Sample size, 32 surgical residents 




Self/Supervisors (MC) 


1 , 2, and 5 


0.21 


Total forms =1,024 




Peer (MC)/Supervisors (MC) 


1 , 2, and 5 


0.25 


Wenrich et al 41 


c 


Nurse (CW)/Phys (MC) medical 


2, and 5 


0.51 


Sample size, 3 1 8 internal medicine physicians 




knowledge 


2, and 5 


-0.4( 


Total forms =1,877 




Nurse (CW)/Phys (MC) humanistic 






Lelliott et al 42 


c 


Self/MC 


2, 3, and 5 


0.47 


Sample size, 347 psychiatrists 




Patients/MC 


2, 3, and 5 


0.85 


Total forms =1 1,426 










Violato et al 43 


c 


Self/MC 


1 , 2, 4, and 5 


0.58 


Sample size, 28 family physicians 




Self/Patients 


1,2, 4, and 5 


0.95 


Total forms =170 




Self/CW 


1, 2, 3, and 5 


0.77 


Hall et al 3 


c 


Self/Patients 


1,2, 3, 4, and 5 


1.30 


Sample size, 295 multiple specialties 




Self/MC 


1 , 2, and 5 


0.37 


Physicians 




Self/Consultant (MC) 


1, 2, and 5 


0.80 


Total forms =1 1,665 




Self-Referring physicians (MC) 


1, 2, and 5 


1.18 






Self/CW 


1,2, 3, and 5 


0.76 






Consultant (MC)/MC 


1,2, and 5 


0.46 






Consultant (MC)/CW 


1,2, 3, and 5 


0.18 


Thomas et al 44 


c 


MC (lntern)/MC 


2, and 5 


0.41 


Sample size, 16 internal medicine residents 




MC (lntern)/CW 


2, and 5 


1.06 


Total forms =177 




MC/CW 


2, and 5 


0.65 


Lipner et al 45 


c 


MC/Patients 


1,2, and 3 


2.60 



Sample size, 356 internal medicine physicians 
Total forms =12,460 

(Continued) 
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Table I (Continued) 



Study source 


Group 


Contrast* 


MSF domain* 


Effect size 
difference (d ulVM *) 


Violato et al 5 


C 


Self/MC 


1, 2, 3, and 5 


0.62 


Sample size, 252 surgeons 




Self/CW 


1, 2, 3, and 5 


0.61 


Total forms =7,237 




Self/Patients 


1, 2, 3, 4, and 5 


0.58 






MC/CW 


1 , 2, 3, and 5 


0.00 






1 ratieilLb 


1 , L, j, anu D 


0 00 






CW/Patients 


3, 4, and 5 


0.00 


Wood et al 


c 


Patients/MC 


1 , and 3 


0.98 


tJalllL'IG o 1 Z.C , / f aUIUI ' CjIUCI 1 Lo 




1 d.LICI 1 Li/ V V 


1 , and 3 


1 .3 1 


Total forms —57 




MC/CW 


1 , and 3 


0.04 


Joshi et al 46 


c 


MC/CW 


3, and 5 


1 .34 


JalllUIC O ULfiLCLI 11.3/ £/' It^-UIUgy 1 CilUCIILb 




1 IV / r a LI CI 1 Li 


3 and 5 


0.43 


Total forms —5 1 2 




CW/Patients 


3, and 5 


0.97 


Lockyer et al 47 


c 


MC/Patients 


1 , 2, and 3 


0.06 


Sample size, 197 anesthesiology physicians 










Total forms =5,957 










Violato et al 48 


c 


Self/MC 


1, 2, and 3 


0.04 


Sample size, 100 pediatric physicians 




Self/CW 


1, 2, 3, and 5 


0.18 


Total forms =3,963 




Self/Patients 


1, 2, 3, and 4 


0.07 






MC/CW 


1, 2, 3, and 5 


0.97 






MC/Patients 


1, 2, 3, and 4 


0.79 






CW/Patients 


1, 3, 4, and 5 


0.26 


Violato et al 32 


c 


Self/MC 


1,2, and 4 


0.83 


Sample size, 101 psychiatry physicians 




Self/CW 


1, 2, 3, 4, and 5 


1 .52 


Total forms =4,069 




Self/Patients 


1, 2, 3, and 4 


1.13 






MC/CW 


1 , 2, 3, and 5 


0.68 






1 1 v / r a LI CI 1 Li 


1 2 3 and 4 


0.28 






CW/Patients 


1, 3, 4, and 5 


0.40 


Archer et al 8 


c 


(Consultant) MC/(Resident) MC 


2, and 5 


0.37 


Sample size, 553 multiple specialties residents 










Total forms =5,544 










Dill _ 1 14 

Pollock et al 


c 


/~\Af /N/1 f~~ 

CW/MC 


1 , 2, 3, 4, and 5 


0.87 












Total forms =240 










Davies et al 40 


c 


Consultant (MC)/CW 


2, and 4 


0.98 


Sample size, 92 histopathology residents 










Total forms =1012 










Campbell et al 33 


c 


Patients/MC 


1, 2, 3, and 5 


0.19 


Sample size, 291 multiple specialties physicians 










Total forms =18,023 










Meng et al 34 


c 


Nurse (CW)/Secretaries (CW) 


1, 3, and 5 


0.16 


Sample size, 1 5 anesthesiology residents 




Nurse (CW)/Nurse aids (CW) 


1, 3, and 5 


0.64 


Total forms =429 




Nurse (CW)/Technicians (CW) 


1, 3, and 5 


0.65 






Secretaries (CW)/Nurse aids (CW) 


1, 3, and 5 


0.16 






Secretaries (CW)/Technicians (CW) 


1, 3, and 5 


0.46 






Nurse aids (CW)/Technicians (CW) 


1, 3, and 5 


0.00 


Lockyer et al 35 


c 


Self/MC 


1 , 2, and 5 


0.22 


Samples size, 1 0 1 pathologists/laboratory physicians 




Self/Referring physicians (MC) 


1,2, 4, and 5 


0.58 


Total forms =808 




Self/CW 


1,2, 3, and 5 


0.18 






MC/Referring physicians (MC) 


1,2, 4, and 5 


0.38 






MC/CW 


1,2, 3, and 5 


0.03 






Referring physicians (MC)/CW 


1,2, 3, and 4 


0.40 



(Continued) 
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Table I (Continued) 



Study source 


Group 


Contrast* 


MSF domain* 


Effect size 
difference (d ulVM *) 


Lockyer et al 36 


C 


Self/MC 


1,2, and 4 


0.78 


Sample size, 187 emergency medicine physicians 




Self/CW 


1 , 2, 4, and 5 


0.93 


Total forms =6,889 




Self/Patients 


1,2, 3, 4, and 5 


1.13 






MC/CW 


1,2, 4, and 5 


0.43 






MC/Patients 


1,2, 3, 4, and 5 


0.63 






CW/Patients 


1,2, 3, and 5 


0.17 


Archer et al 15 


c 


Consultant (MC)/Resident (MC) 


2, and 5 


0.64 


Sample size, 577 pediatric residents 










Total forms =4,770 










Chandler et al 16 


c 


Self/Attending (MC) 


3, and 5 


0.87 


Sample size, 66 pediatrics residents 




Self/CW 


3, and 5 


1.10 


Total forms =823 




Self/Patients 


3, and 5 


0.08 






Attending (MQ/CW 


3, and 5 


0.26 






Attending (MC)/Patients 


3, and 5 


0.30 






CW/Patients 


3, and 5 


0.45 


Campbell et al" 


c 


Patients/MC 


1 , 2, 3, and 5 


0.02 


Sample size, 1 79 family physicians 










Total forms =10,895 










Archer and McAvoy 37 


c 


Patients/MC 


2, and 5 


1.90 


Sample size, 68 different specialties physicians 




Assessor nominated by physicians/ 


2, and 5 


1.91 


Total forms =2,365 




assessors nominated by referring body 






Overeem et al 38 


c 


MC/Patients 


1 , 2, 3, 4, and 5 


0.44 


Sample size, 146 multiple specialties 




MC/CW 


1,2, 3, and 4 


0.75 


Physicians 




CW/Patients 


1,2, 3, and 5 


0.45 


Total forms =3,648 










Lockyer et al 39 


c 


Self/MC 


1,2, 3, and 4 


I.I 1 


Sample size, 216 surgeons 




Self/CW 


1 , 2, and 3 


0.86 


Total forms =9,072 




Self/Patients 


1,2, 3, 4, and 5 


1.00 






MC/CW 


1,2, 3, and 4 


0.44 






MC/Patients 


1,2, 3, 4, and 5 


0.30 






CW/Patients 


3, 4, and 5 


0.21 


Qu et al 23 


c 


Self/Attending (MC) 


1 , and 3 


0.30 


Sample size, 258 multiple specialties residents 




Self/MC 


1 , and 3 


0.13 


Total forms =4, 1 28 




Self/CW 


1 , and 3 


-0.55 






Self/Patients 


1 , 2, 3, 4, and 5 


0.19 






Self/Office staff (CW) 


1 , and 3 


1.78 






Attending (MC)/MC 


1 , and 3 


0.08 






Attending (MC)/CW 


1 , and 3 


0.82 






Attending (MC)/Patients 


1,2, 3, 4, and 5 


0.38 






Attending (MC)/Office staff (CW) 


1 , and 3 


2.31 






Patients/Office staff (CW) 


1,2, 3, 4, and 5 


1.87 






Patients/MC 


1,2, 3, 4, and 5 


0.37 






Patients/CW 


1,2, 3, 4, and 5 


0.42 


Lockyer et al 49 


c 


Self/MC 


1 , and 2 


0.22 


Sample size, 37 general practice physicians 




Self/CW 


1, 2, and 3 


0.05 


Total forms =1,130 




Self/Patients 


1,2, 3, and 4 


0.04 






MC/CW 


1, 2, and 3 


0.22 






MC/Patients 


1,2, 3, and 4 


0.21 






CW/Patients 


1 , 3, and 4 


0.00 


Risucci et al 7 


D 


MSF/ABSITE 


1, 2, and 5 


1.45 


Sample size, 32 surgical residents 










Total forms =1,024 










Wood et al 27 


D 


MSF (PT)/global examination 


1 , and 3 


1.96 


Sample size, 7 radiology residents 




MSF (MC)/global examination 


1 , and 3 


1.02 


Total forms =57 




MSF (CW)'global examination 


1 , and 3 


1.60 



(Continued) 
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Table I (Continued) 

Study source Group Contrast 1 MSF domain* Effect size 
difference (d uwf t) 

Davies et al 40 D MSF (PATH-SPRAT)/OSPE 2, and 3 1 .09 

Sample size, 92 histopathology residents 
Total forms =1,012 

Yangetal 24 D MSF/small scale OSCE 1,2, and 3 0.79 

Sample size, 245 multiple specialties residents MSF/small scale OSCE + DOPS 1,2, and 3 2.07 

Total forms =1,053 

Notes: +A, predictive validity (physicians in different years level); B, predictive validity (physicians performance on MSF in two occasions separated with time); C, concurrent 
validity (differences in personnel ratings); D, construct validity (comparing MSF with standardized measures). *MSF domains consist of the following: I- professionalism, covering 
psychosocial skills, psychosocial management, humanistic qualities, compassion, attitude, professional development, teaching, and professional responsibilities and professional 
managements; 2- clinical competence covering clinical care, good medical practice, patient care, safe practice, clinical performance, knowledge, critical thinking, diagnosis, 
and management of complex problem; 3= communication, covering communication with staff and interpersonal communication skills; A— management, covering reporting, 
self-management, administrative skills, office personal, access to doctor, practice process, physical office, and physical space; and 5= interpersonal relationships, covering 
relationships with patients, colleagues, family members, collegiality, collaboration, patient education, information provision, and patient interaction. Two of the authors (AA, TD) 
agreed on the names of the main five domains and agreed on the items included. d UWM * refers to the unweighted mean effect size difference as defined by Cohen's d. 
Abbreviations: CW, coworkers; MC, medical colleagues; MSF, multi-source feedback; PGY, postgraduate year; SPRS, specialist registrar; Phys, family physician; ObGyn, obstetrics 
and gynecology; CW, coworkers; ABSITE, American Board Of Surgery In-Training Examination; PATH-SPRAT, Pathology-Sheffield Peer Review Assessment Tool; OSPE, Objective 
Structured Practical Examination; OSCE, Objective Structured Clinical Examination; DOPS, Direct Observation of Procedural Skills; SHO, senior house officer; PT, patients. 



raters (eg, comparison of patients with self assessment, 
medical colleagues to coworkers) ranged from d=0.50 
(95% CI 0.47-0.52) for interpersonal relationships to d=0.57 
(95% CI 0.55-0.60) for both professionalism and clinical 
competence. Most of the studies in group C showed that 
physicians consistently rated themselves lower than did other 
assessor groups. However, in a study of 258 residents within 
different specialties reported by Qu et al, residents on self- 
assessments rated themselves higher than did other raters. 23 
As shown in the forest plot (Figure 2), the combined random- 
effects size calculation for the professionalism domain was 
"medium" (d=0.66, 95% CI 0.44-0.69). 

In group D (Table 3), of the 35 studies included in the 
meta-analysis, four reported data on physician/surgeon 
performance on MSF in comparison with other criterion 
measures (eg, OSPE, OSCE). The mean effect size differ- 
ences were found to be "medium" to "high" across each 
of the five domains identified on MSF. Effect size differ- 
ences in performance between domain scores and other 



examination measurement scores ranged from d=l.2S 
(95% CI 1.15-1.41) for clinical competence to d=l.43 
(95% CI 0.87-2.00) for interpersonal relationships. Yang 
et al 24 found a range of MSF scores that varied from d=QJ9 
for residents on the domains of professionalism, clinical 
competence, and communication skills to d=2.01 with the 
same physicians on the same domains when their MSF 
scores were compared with other clinical performance 
measures such as the OSCE. 

Although the Cochran Q test shows significant hetero- 
geneity between the studies included in the four groups, a 
subgroup analysis to determine the potential differences as a 
result of moderator variables such as physician/surgeon sex or 
age was limited by the data reported across the primary stud- 
ies included in the meta-analysis. Nevertheless, the studies 
were weighted by their respective sample sizes, and the 
random-effects model analysis (with greater than 95% CIs) 
provide a more conservative estimate of the combined effect 
sizes as illustrated by a forest plot (Figure 2). 



Table 2 Random effects model (Cohen's d) of the MSF domains with different physician years (group A)/different physician performance 
in two occasions (group B) 



MSF domain 
measure 


Studies included 
(number of 
outcomes) 


Sample 
size 


MSF with different 
physician years* 


Studies included 
(number of 
outcomes) 


Sample 
size 


Difference between 
physicians' performance 
on two occasions** 


Professional 


2(4) 


126 


0.56 (0.39-1.59) 


3(6) 


1,054 


0.65 (0.30-1.00) 


Clinical competence 


5(7) 


1,335 


0.62 (0.25-1.00) 


3(4) 


554 


0.99 (0.53-1.45) 


Communication 


1 (1) 


72 


1.78 (1.22-2.34) 


2(3) 


750 


0.23 (0.02-048) 


Manager 


1 (3) 


54 


0.14 (0.40-0.69) 


3(3) 


567 


0.92 (0.01-1.84) 


Interpersonal 


4(6) 


1,263 


0.42 (0.16-0.67) 


2(2) 


317 


1.50 (0.19-3.22) 


relationships 















Notes: *Effect sizes combined for physicians in different year levels (different PGY level, eg, year I, year 2, senior house officer, specialist registrar); 81519 " 21 **effect sizes 
combined for physicians' performance on two occasions separated by time {eg, 5 years, 7 months, 7 years). 9,18 19,22 
Abbreviations: MSF, multi-source feedback; PGY, post graduate year. 
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Study source* 



Weighted mean difference (95% CI) 



Campbell et 






0.19(0.027 0.35) 


Campbell et al 


Pt/MC 




0.02 (-0.18 — 0.23) 


Hall et al^ 


Const(MC)'CW 




0.18(0.02-0.34) 


Hall et al 


Consl{MC)/MC 




0.46 (0.30 — 0.62) 


Hall et al 


Self/(Const)MC 




0.80 (0.63-0.97) 


Hall et al 


Self/CW 




0.76 (0.59 — 0.93) 


Hall et 


Self/MC 




0.37 (0.21 -0.53) 


a e a^ 


6 


tarn 


1.30 (1.12-1.48) 


a e a 


Self'Refphys(MC) 




1.18(1.00-1.35) 


ipnere a 






2.60 (2.40 - 2.80) 


Lockyer et af" 






0.06 ( 0.14-0.26) 


Lockyer et aF> 






0.18 (-0.10 — 0.48) 


Lockyer et aF 


Refphys(MC)/CW 




0.40 (0.12 — 0.68) 


Lockyer et aF 


RefPh MC/Cw' Se ' f 




0.58 (0.29 — 0.86) 








0.03 (-0.25 — 0.31 ) 


Lockver et aF 


Ref phy s( M C )/M C 




0.38 (0.10-0.66) 


Locker et aF 


Self/MC 




- ( ■ - ■ 


Lockver et al 36 




_ 74 


0.17 ( 0.03-0.37) 


Locker et al 36 


MC/CW 


374 


0.43 (0.22 - 0.63) 


Lockver et al 36 
ockyere a^ 


MC/Pt 




0.63 (0.42 - 0.84) 


Lockyer et al 


Self/CW 


374 


0.93 (0.71-1.14) 


Lockyer et a I 36 


Self/MC 


374 


0.78 (0.57 — 0.99) 


Lockyer et a I 36 


Self/Pt 


374 


1.13 (0.91 — 1.35) 


Lockyer et aF 


MC/CW 




0.44 (0.25 — 0.63) 


Lockyer et al 39 


Self/Pt 


432 


1 .00 (0.80 - 1.20) 


Lockyer et a I 39 


Self/MC 


432 


1 .1 1 (0.90 — 1.30) 


Lockyer et al 39 


MC/Pt 


432 


0.30 (0.11 — 0.49) 


Lockyer et a I 39 


Self/CW 


432 


0.86 (0.66 — 1.10) 


Meng et al w 


CW( N u ) /C W( N u A) 


30 


0.64 (-0.13 — 1.14) 


Meng et al" 


CW(Nu)/CW(Sec) 




0.16(-0.60-0.90) 


Meng et al M 


CW(Nu)/CW(Tech) 




0.65 (-0.12 — 1.42) 


Meng et al^ 


CW(NuA)/CW(Tech) 




0.00 (-0.75 - 0.75) 


Meng et al 


CW(Sec)/CW(NuA) 




0.16 ( 0.60-0.90) 


Meng et al 


C W(Sec )/C W{Tec h ) 


30 


0.46 (-0.30 — 1 .22) 


Overeem et 


MC/CW 




0.75 (0.51 -0.99) 


vereem 






0.44 (0.21 -0.67) 


vereem e a 






0.45 (0.22 - 0.68) 


o oc e a 














0 37 (019 ~0 54) 


ue a 


S If/Pt 


516 


0 1 9 (0 02 — 0 36) 


Qu et al 23 


Self/MC 


516 


0.1 3 (-0.04 — 0.30) 


Qu et al 23 


Self/CW 


516 


-0.55 (-0.72 — 0.37) 


Qu et al 23 


Self/Officstaff(CW) 




1 .78 (1 .57 — 1 .98) 


Qu et al 23 


PT/CW 


516 


0.42 (0.24 — 0.59) 


Qu et al 23 


Pt/Offi cstaff ( C W ) 




1.87 (1.66-2.10) 


Qu et al 23 


Attend(MC)/Pt 


5 


0.38 (0.20 — 0.55) 


Qu et al 23 


Attend(MC)/MC 


516 


0.08 (-0.09 — 0.25) 


Oil '-\ 


Atte n d ( M C )/C W 


516 


0.82 (0.64 — 1 .00) 


Qu et al 13 


Attend j M C)/ Off! cstaff(CW) 


516 


2.31 (2.10 — 2.53) 


Qu et al 23 


Attend ( M C )/Se 1 f 


516 


0.30 (0. 13 — 0.47) 


Risuccl et al 7 


Peer(MC)/Supervisors(MC) 


64 


0.25 (-0.26 — 0.75) 


Risuccf et al 7 


Se If /S upe rv isors ( M C) 


64 


0.21 (-0.29 — 0.71) 


Risucci et al 7 


Self/Peer(MC) 


64 


0.56 (0.05 — 1.10) 


Violate et al 43 


Self/MC 




0.58 (0.02 — 1.12) 


Violate et 


Self/R 




0 95 (0 37 — 1 50) 


10 a o e a 


Self/CW 




0 77 (0 20 — 1 31 ) 


Violate et al 


MC/CW 




0.00 (-0.17 — 0.17) 


Violato et al 5 


Self/CW 




0.61 (0.43 — 0.79) 


Violato et al 5 


Self/MC 




0.62 (0.44 — 0.80) 


Violato et al 5 


Self/Pt 




0.58 (0.40 — 0.76) 


Violato et al 5 






0.00 (-0.17 — 0.17) 


Violato et al* 8 


CW/Pt 




0.83 (0.54— 1.12) 


Violato et al 1B 


MC/CW 




0.79 (0.50 — 1.12) 


Violato et al* 8 


MC/Pt 




0.26 (-0.02 — 0.54) 


Violato et al* 8 


Self/CW 




0.07 (-0.21 — 0.35) 


Violato et al* 8 


Self/MC 




0.18 (-0.10 — 0.46) 


Violato et al* 8 


Self/R 


aw 


0.97 (0.68 — 1.27) 


Lockyer et af* 9 


CW/Pt 


74 


0.22 (-0.25 — 0.68) 


Lockyer et a I* 9 


MC/CW 


74 


0.21 (-0.26 - 0.67) 










Locker et al* 9 


Self/CW 


74 


0.22 (-0.25 — 0.68) 


Lockyer et af* 9 


Self/MC 


74 


0.04 (-0.42 - 0.50) 


Lockyer et a I* 9 


Self/R 


74 


0.00 (-0.46 - 0.46) 


Violato et al 32 


CW/Pi 


202 


0.40 (0.12-0.68) 


Violato et al 32 


MC/CW 


202 


0.68 (0.39-0.96) 


Violato et al 32 


MC/Pt 


202 


0.28 (0.00-0.56) 


Violato et al 32 


Self/CW 


202 


1.52 (1.21 -1.84) 


Violato et al 32 


Self/MC 


202 


0.85 (0.56-1.40) 


Violato et al 32 


Self/R 


202 


1.13(0.83-1.42) 


Wood et a I 27 


Pt/MC 




0.98 (-0.33-2.16) 


Wood et al 27 


MC/CW 




0.04 (-1.13-1.20) 


Wood et a I 27 


Pt/CW 


14 


1.31 (-0.8-2.53) 




Combined(82) 


24830 


0.57 (0.55-0.60) 


Random 


Combined(82) 


24830 


0.56 (0.44-0.69) 



Figure 2 Random and fixed effects model forrest plots for the MSF "personnel rating differences" for professional measures. 

Notes: *The effect size values are taken from the raw data reported for the outcomes in studies group C. The Cochran Q-test for heterogeneity shows significant overall 
heterogeneity between studies. 

Abbreviations: MSF, multi-source feedback; Pt, patients; MC, medical colleagues; Const, consultant; CW, co-workers; RefPhys, referring physicians; Nu, nursing; 
NuA, nursing aid; Sec, secretary; Tech, technicians; Officstaff, office staff; Attend, attending. 



Discussion 

In this meta-analysis, the MSF demonstrates evidence of 
construct validity when used with physicians and surgeons 
across the years of a residency program or a number of years 
of practice. Physician/surgeon performance on the MSF 



domains across a single year of practice showed "small" to 
"large" effect size differences, with effect sizes ranging from 
d=0.\4 (95% CI 0.40-0.69) in the manager skills domain 
to d=1.7S (95% CI 1.20-2.30) in the communication skills 
domain. 
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Table 3 Random effects model (Cohen's 


d) of the MSF domains with personnel ratings/academic performance 


(groups C and D) 


MSF domain 


Studies included 


Sample 


Personnel rating 


Studies included 


SampU 


i MCE wi4-U 


measure 


(number of 


size 


differences* 


(number of 


size 


different global 




outcomes) 






outcomes) 




1 1 IcoSU rcl 1 lent 


Professional 


19 (82) 


12,415 


0.56 (0.44-0.67) 


3(6) 


543 


1 .42 (0.72-2. 1 2) 


Clinical competence 24 (75) 


12,720 


0.60 (0.49-0.72) 


3(4) 


614 


1 .34 (0.65-2.05) 


Communication 


20 (76) 


1 1,280 


0.56 ( 0.42-0.67) 


3(6) 


603 


1.35 (0.71-1.99) 


Manager 


13(38) 


6,089 


0.60 (0.45-0.74) 








Interpersonal 


23 (74) 


1 1,660 


0.54 ( 0.44-0.64) 


1 (1) 


32 


1.43 (0.87-2.00) 


relationships 














Notes: *Effect size 


combined between differences in 


personnel ratings (ie, 


resident versus faculty, 


specialist versus consultant); 3 5,78 


*j7W7JHW-« ** e ffect sizes combined 



between MSF with standardized measures (eg, global ratings, OSPE). 7,24,27,40 

Abbreviations: MSF, multi-source feedback; OSPE, Objective Structured Practical Examination. 



The effect size differences between physician/surgeon 
performance on two occasions (time 1/time 2) ranged from 
d=0.23 (95% CI 0.13-0.33) for the communication skills 
domain to <i=0.90 (95% CI 0.74-1.10) for the interpersonal 
relationship domain measure. 

The differences in rating for physician/surgeon perfor- 
mance on MSF between different assessor groups (self- 
assessments, medical colleagues, consultants, patients, and 
coworkers) showed "medium" effect size differences that 
ranged from d=0.50 (95% CI 0.47-0.52) for the interper- 
sonal relationship domain to d=Q.51 (95% CI 0.55-0.60) 
for the professionalism and clinical competence domains. 
In particular, these results were supported by the findings 
from other assessment methods such as the mini-clinical 
evaluation exercise (mini-CEX). Ratings with different 
raters in the mini-CEX have showed that in comparison 
with faculty evaluator ratings, residents tend to be more 
lenient and score trainees higher on in-training evaluation 
checklists. 25 26 In our study of the MSF, we found that 
physicians and surgeons consistently rated themselves 
lower than did other assessor groups. 23 In addition, patients 
and coworkers typically rated physicians/surgeons more 
leniently than did other raters, such as medical colleagues 
or consultants. 

The MSF showed evidence of criterion-related valid- 
ity when compared with other performance examination 
measures (eg, global examination, OSPE, OSCE). We found 
a "large" correlation coefficient, with combined effect sizes 
ranging from c/=l .28 (95% CI 1.15-1.41) for the communi- 
cation skills domain to d=lA3 (95% CI 0.87-2.00) for the 
interpersonal relationship domain. 

The construct-related and criterion-related validity of 
MSF was supported by the findings outlined within the stud- 
ies included in one or more of the four group comparisons. 
As illustrated in the forrest plots for the professionalism 
domain in group C, not all of the reported differences between 



personnel ratings were found to be statistically significant. 
When combined with the outcomes from 19 different stud- 
ies, however, we found that there was a significant combined 
random-effects size of d=0.65 (95% CI, 0.44-0.69). 

In general, the findings of this meta-analysis shows 
"medium" combined effect sizes for the construct-related 
and criterion-related validity of the five main MSF domains 
identified. Although different questionnaires and different 
numbers of items were used in MSF across different spe- 
cialties, they were found to consistently measure similar 
domains ofphysician/surgeon performance. 15 This feedback 
process using multiple questionnaires in different type of 
raters provides a more comprehensive evaluation of clini- 
cal practice than can typically be provided by one or few 
sources. 1 

Strengths and weaknesses of the study 

There are limitations to this meta-analysis. Because we 
were interested in determining the construct-related and 
criterion-related validity of MSF as a method for physician/ 
surgeon evaluation, consistency in the use of the evaluation 
tool varied from a research design perspective. In addition, 
there was variability in the performance domains measured 
and in the number of items used to measure each domain 
depending on the MSF instrument used (ie, ranging from 
four items to 60 items), the raters used (ie, self, patients, 
medical colleague, coworker), and whether or not the MSF 
was being compared with other clinical skill measures (ie, 
OSCE). To overcome this limitation, the more conservative 
random-effects size analysis was performed to accommo- 
date for the heterogeneity between the studies as indicated 
by the significant values obtained using the Cochran Q 
test. Nevertheless, we were unable to undertake subsequent 
subgroup analyses to determine where there may have been 
between-study differences because these data (eg, sex, age 
of participant) were rarely reported. Although some of 
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the studies had small sample sizes such as six 14 and seven 
participants, 27 this was in part compensated by the 40 and 
eight raters who completed the questionnaire, respectively, 
on each of the participants in these studies. To achieve some 
control over the quality of the studies that were included in 
this meta-analysis, only papers that had been published in 
refereed journals were selected. 

Implications for clinicians 
and policymakers 

Certain characteristics of health professionals, such as clini- 
cal skills, personal communication, and client management, 
combined with improved performance can be assessed using 
MSF. 8 MSF is a unique form of assessment that has been 
shown to have both construct-related and criterion-related 
validity in assessing a multitude of clinical and nonclinical 
performance domains. In addition, MSF has been shown to 
enhance changes in clinical performance, 15 communication 
skills, 7 professionalism, 7 teamwork 28 , productivity, 29 and 
building trusting relationship with patients. 30 

Consequently, MSF has been adopted and used extensively 
as a method for assessment of a variety of domains identified 
in medical education programs and licensing bodies in the 
UK, Canada, Europe, and other countries as well. Although 
MSF has gained widespread acceptance, the literature has 
raised a number of concerns about its implementation and 
its validity. Therefore, the availability of evidence to support 
the validity of the process and the instruments used to date 
is of crucial importance to enable policymakers to make the 
decision to implement MSF within their own programs or 
organizations. 

Conclusion and future research 

Although MSF appears to be adequate for assessment of 
a variety of nontechnical skills, this approach is limited to 
feedback from peers or medical colleagues abilities to assess 
aspects of clinical skills competence that reflect physicians'/ 
surgeons' knowledge and non-cognitive behavior. In particu- 
lar, as part of the process of assessing clinical performance, 
other methods such as procedures-based assessment or the 
OSCE should be used in conjunction with the peer MSF 
questionnaire to ensure accurate assessment of these specific 
skills. 

We are faced with the challenge of ensuring that use of 
MSF for assessment of physicians and surgeons in practice 
is reliable and valid. As shown above, MSF has proved to 
be a useful method for assessing the clinical and nonclinical 
skills of physicians/surgeons in practice with clear evidence 



of construct and criterion-related validity. Although MSF is 
considered to be a useful assessment method it should not 
be the only measure used to assess physicians and surgeons 
in practice. Other reliable and valid methods should be used 
in conjunction with MSF, in particular to assess procedural 
skills performance and to overcome the limitation of using 
a single measure. 

Future research should be considered by researchers in 
order to replicate and extend some of the empirical find- 
ings, especially the evidence for criterion-related validity. 
Criterion-related validity studies looking at correlations 
between direct observations of behavior or performance 
and MSF scores are required to add further evidence of 
validity Future research on the various MSF instruments 
available may well include confirmatory factor analysis, 
which provides stronger construct validity evidence than the 
principal component factor analyses conducted currently. 31 
In addition, MSF assessments are entirely questionnaire- 
based and rely on the judgment of and inference by the 
assessors and respondents, which are subject to a variety 
of biases and heuristics. Therefore, generalizability theory 
should be used in future studies to determine potential 
sources of error measurement that can occur due to use of 
different assessors and specialties, as well as the character- 
istics of the respondents themselves. 
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